Classification Models in Action using the Classical Titanic Dataset

Juliane Manitz

Objective & Prerequisites & Target Audience

Objective: We exemplify classification methods using the Titanic case study. We use advanced methods like Random Forests and build these concepts from the ground up; no prior knowledge of Machine Learning is required.

To get the most out of this session, students should ideally have a foundational understanding of:

  • Basic Probability: Understanding of random variables and distributions.
  • Introductory Statistics: Familiarity with the concept of regression (linear models) and hypothesis testing.
  • R Programming: Basic knowledge of the R syntax to follow the code implementation.

Classification

Classification describes the process of predicting a discrete label (category) based on input data.

Real-World Applications & Impact

  • Healthcare: Medical diagnosis (e.g., “Malignant vs. Benign”) - Saving lives through early detection.

  • Finance: Credit Scoring & Fraud Detection - Assessing risk and securing global transactions.

  • Technology: Spam filtering and sentiment analysis - Curating the digital experience.

Table of Contents

  1. Classical Titanic Dataset

  2. Machine Learning Workflow

  3. Decision Tree: A non-linear, tree-structured approach that recursively splits data based on feature values to form decision rules, offering high interpretability.

  4. Random Forest: An ensemble method that builds multiple decision trees (a “forest”) and takes a majority vote for the final prediction.

  5. More Learning

Example: Classical Titanic Dataset

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg.

Key Variables:

  • Survived: 0 = No, 1 = Yes
  • Pclass: Ticket class (1st, 2nd, 3rd)
  • Sex: Male/Female
  • Age: Age in years
  • SibSp: Number of siblings/spouses aboard
  • Parch: Number of parents/children aboard
  • Fare: Passenger fare

Survival: Women and Children First?

Question: Unfortunately, there weren’t enough lifeboats for everyone on board. What sorts of people were more likely to survive?

Survival: … Or Rich People First?

Question: Unfortunately, there weren’t enough lifeboats for everyone on board. What sorts of people were more likely to survive?

Machine Learning Workflow

  1. Data Splitting & Cleaning
  2. Feature Engineering (where the “science” happens).
  3. Model Training & Selection.
  4. Validation & Evaluation.
  5. Deploy & Monitor

1

Data Splitting

The accuracy of an estimate \(\hat f(x)\) depends on reducible and irreducible error:

\[\text{E}(y - \hat f(x))^2 = |f(x) - \hat f(x)|^2 + Var(\epsilon)\]

Machine Learning (ML) uses dataset splits into training and test datasets to find the optimal model.

  • Training: preprocess variables, tune (hyper-)parameters, …
  • Test: get one unbiased assessment of model performance

Data Preparation

Data Splitting

Split our dataset in two, one dataset for training and the other one for testing.

set.seed(1234)

spl <- dt |> 
  select(Survived, Pclass, Sex, Age, 
         SibSp, Parch, Fare, Embarked) |> 
  initial_split(dt, prop = 0.8, 
                strata = "Survived")

train <- training(spl)
test <- testing(spl)

Data Cleaning

rf_rec <- 
  # Define classification variables
  recipe(Survived ~ ., data = train) |>
  # Missing value imputation
  step_impute_median(all_numeric()) |>
  # Factor handing: split into binary terms
  step_dummy(all_nominal_predictors()) |>
  # execute transformation
  prep()

# Extract imputed training data 
train_dt <- juice(rf_rec)

# Apply pre-processing to test data
test_dt  <- bake(rf_rec, new_data = test) |> 
  as.data.frame()

Decision Tree

  1. Find optimal split that achieves the highest purity
  2. The prediction is obtained as by majority vote in each region
  3. Repeat process within each of the resulting regions
  4. Stopping criterion, e.g. < 5 observation in each region
  5. Apply weakest link pruning is used to avoid overfitting

Ensemble Method: Bagging

Bootstrap aggregating is a general-purpose procedure designed to improve stability and accuracy

  • Generate \(B\) bootstrapped datasets of size \(n\) (with resampling)
  • Train the method on each subset and obtain prediction \(\hat f^{*b}(x)\) at a point \(x\).
  • Then, average all the predictions for this point \(x\): \(\hat f_{bag}(x) = \frac{1}{B} \sum_b \hat f^{*b}(x)\)

1

Random Forest

Bagging trees Construct \(B\) trees using bootstrapped datasets. Each tree is grown deep and not pruned

Random forests select at each candidate split a random subset of features (\(m < \sqrt{p}\))

# Model specification
rf_model <- 
  # Define model + parameter
  rand_forest(trees = 2000, mtry = 3) |> 
  set_engine("ranger", importance="permutation") |>
  # Set binary response
  set_mode("classification")  

# Model fit
rf_fit <- rf_model %>% 
  fit(Survived ~ ., data = train_dt)

Evaluate performance

# Obtaining predictions
rf_pred <- rf_fit %>%
  predict(test_dt) %>%
  bind_cols(test_dt)

# evaluate prediction (yardstick)
rf_pred %>% 
  conf_mat(truth=Survived, estimate=.pred_class)
          Truth
Prediction Survived Died
  Survived       42    7
  Died           27  103

AUC / ROC

rf_probs <- rf_fit %>%
  predict(test_dt, type="prob") %>%
  bind_cols(test_dt)
AUC <- rf_probs %>% 
  roc_auc(Survived, .pred_Survived)

Quick Knowledge Check

  1. Variable Importance: Which variable do you think was the strongest predictor across all models?
    1. Passenger Class
    2. Sex/Gender
    3. Age
    4. Port of Embarkation
  2. Model Complexity: Why might the Decision Tree be preferable over a Random Forest in a courtroom or medical setting?

Hint: Think about “Interpretability” vs. “Black Box”.

  1. The “Human” Factor: Based on our \(R^2\) or Accuracy, can we ever predict survival with 100% certainty? Why or why not?

More Learning

Other Classification Approaches

  • Logistic Regression: A linear model for binary classification that estimates the probability of a data point belonging to a particular class using the sigmoid function.

  • K-Nearest Neighbor: A simple, distance-based “lazy learner” that classifies data based on the majority class of its (k) closest neighbors.

  • Naive Bayes: A probabilistic classifier based on Bayes’ Theorem.

  • Gradient Boosting: Sequential ensemble techniques that build trees to correct previous errors.

  • Support Vector Machines: Finds the optimal hyperplane to maximize the margin between different classes.

  • Artificial Neural Networks: Deep learning models through interconnected layers, suitable for complex pattern recognition.

Further Reading & Exercises

James, Witten, Hastie and Tibshirani. An Introduction to Statistical Learning (R/Python)

Kuhn and Silge. Tidy Modeling with R: A Framework for Modeling in the Tidyverse.